Cleaning¶
At this stage, we use the file we had saved in the input_data folder to create datetime variables,
[1]:
use "..\input_data\NYPD_Motor_Vehicle_Collisions",clear
Notice that we used ..\ here again. When this jupyter notebook is opened, its current directory is in the “dofiles” folder. Therefore, we use the relative paths to “move” to the “input_data” to get the dataset.
[2]:
notes list
_dta:
1. "Downloaded 9 Dec 2018 11:30:26"
browse in your Stata, to have an idea of the data
[3]:
*browse
Re-order variables¶
Personally, I prefer ID type variables to be in the first few columns of the dataset. We will use Stata’s order command to move the variable uniquekey to the first column.
[4]:
order(uniquekey), before(date)
Create date time variables¶
- Preview the date variable (through
list,codebook,browse(Data Editor)) - Determine the format of the date variable
- Is it a string, int, float, etc
- Determine if it is in other types of date format (e.g. Excel’s date format)
[5]:
list date in 1/4
+------------+
| date |
|------------|
1. | 11/16/2018 |
2. | 11/16/2018 |
3. | 11/16/2018 |
4. | 11/16/2018 |
+------------+
[6]:
codebook date
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
date DATE
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
type: string (str105)
unique values: 2,331 missing "": 0/1,386,003
examples: "03/24/2015"
"06/08/2017"
"08/15/2015"
"10/20/2013"
warning: variable has leading and embedded blanks
Now, we determine that the date variable is in a string format. We will use Stata’s HRF-to-SIF conversion functions in the datetime command.
[7]:
gen date_SIF = date(date, "MDY"),after(date)
(1 missing value generated)
We know that the date is a string variable, and it is in HRF( Human readable form) in the form of Month/Day/Year. We tell Stata’s date() function that through the mask "MDY". Then, I tell Stata to put date_SIF in the column after date.
[8]:
list date_SIF in 1/4
+----------+
| date_SIF |
|----------|
1. | 21504 |
2. | 21504 |
3. | 21504 |
4. | 21504 |
+----------+
The variable is in SIF(Stata internal form) now. So Stata understands it is a date. But in order for us to understand, we tell Stata to display this number in HRF through the format command. The date format is %td and the time format is %tc.
[9]:
format date_SIF %td
We repeat this for time variable.
[10]:
gen time_SIF = clock(time, "hm"),after(time)
format time_SIF %tc
(1 missing value generated)
Why do we want Stata to understand dates and times?¶
Because it allows us to do so much more! - view summary statistics of the date (e.g. first date, last date) - create year and month variables - create dummies based on durations - create dummies based on boolean logic
[11]:
sum date_SIF, format
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
date_SIF | 1,386,002 14oct2015 669.6435 01jul2012 16nov2018
[12]:
gen year_SIF = year(date_SIF) // Extract year from the date
(1 missing value generated)
[13]:
gen month_SIF = month(date_SIF) // Extract month from the date
(1 missing value generated)
[14]:
gen special_period = (date_SIF >= td(14feb2014) & date_SIF <= td(14march2014))
tab special_period
special_per |
iod | Freq. Percent Cum.
------------+-----------------------------------
0 | 1,370,768 98.90 98.90
1 | 15,235 1.10 100.00
------------+-----------------------------------
Total | 1,386,003 100.00
Renaming long variable names¶
This is entirely up to you. I prefer a shorter variable name, and use labels to store the longer and more descriptive name. Labeling is the more important thing to do when you begin this process of building an analytical dataset.
[15]:
rename numberofcyclistinjured cyclistinjured
rename numberofcyclistkilled cyclistkilled
[16]:
label variable cyclistinjured "Number of cyclists injured"
label variable cyclistkilled "Number of cyclists killed"
Creating dummies from long strings¶
I will show you several methods to finding keywords from string variables. For example, we want to create a variable to indicate if a collision occurred with a bicycle. - strpos - contains exact “string”
- Explore the string variable that we want to extract information from.
[17]:
tab vehicletypecode1
VEHICLE TYPE CODE 1 | Freq. Percent Cum.
------------------------------------+-----------------------------------
(ceme | 1 0.00 0.00
15 Pa | 1 0.00 0.00
2 TON | 1 0.00 0.00
2 dr sedan | 6 0.00 0.00
3-Door | 30 0.00 0.00
315 e | 1 0.00 0.00
3D | 32 0.00 0.01
4 dr sedan | 202 0.01 0.02
4DSD | 3 0.00 0.02
4whee | 1 0.00 0.02
99999 | 1 0.00 0.02
ABULA | 1 0.00 0.02
ACCES | 2 0.00 0.02
AM | 292 0.02 0.04
AMABU | 1 0.00 0.04
AMB | 7 0.00 0.04
AMBU | 5 0.00 0.04
AMBUL | 111 0.01 0.05
AMBULANCE | 2,406 0.17 0.23
AMbul | 1 0.00 0.23
APORT | 1 0.00 0.23
APPOR | 1 0.00 0.23
AR | 41 0.00 0.23
ARMOR | 1 0.00 0.23
ARMY | 1 0.00 0.23
Ambul | 20 0.00 0.23
Ambulance | 265 0.02 0.25
Armored Truck | 42 0.00 0.25
BA | 6 0.00 0.25
BACK | 2 0.00 0.25
BACKH | 4 0.00 0.25
BED T | 1 0.00 0.25
BICYC | 1 0.00 0.25
BICYCLE | 5,567 0.40 0.66
BOAT | 1 0.00 0.66
BOBCA | 3 0.00 0.66
BOX | 1 0.00 0.66
BOX T | 5 0.00 0.66
BR | 42 0.00 0.66
BS | 2 0.00 0.66
BU | 2,229 0.16 0.82
BULLD | 2 0.00 0.82
BUS | 14,057 1.02 1.85
BUs | 1 0.00 1.85
Backh | 1 0.00 1.85
Beverage Truck | 31 0.00 1.85
Bike | 1,912 0.14 1.99
Boom | 1 0.00 1.99
Box | 1 0.00 1.99
Box Truck | 2,676 0.19 2.18
Box t | 1 0.00 2.18
Bucke | 1 0.00 2.18
Bulk Agriculture | 4 0.00 2.18
Bus | 1,992 0.14 2.33
CABIN | 1 0.00 2.33
CAMP | 1 0.00 2.33
CARGO | 2 0.00 2.33
CART | 1 0.00 2.33
CB | 89 0.01 2.34
CHERR | 1 0.00 2.34
CHEVY | 1 0.00 2.34
CM | 41 0.00 2.34
CMIX | 1 0.00 2.34
CO | 2 0.00 2.34
COM | 8 0.00 2.34
COM T | 1 0.00 2.34
COM. | 1 0.00 2.34
COMB | 1 0.00 2.34
COMME | 11 0.00 2.34
COMMM | 1 0.00 2.34
CONCR | 1 0.00 2.34
CONST | 1 0.00 2.34
CONV | 398 0.03 2.37
COUPE | 1 0.00 2.37
CRANE | 1 0.00 2.37
CUSHM | 1 0.00 2.37
Cargo | 1 0.00 2.37
Carry All | 178 0.01 2.38
Cat 9 | 1 0.00 2.38
Chassis Cab | 83 0.01 2.39
Cmix | 1 0.00 2.39
Comix | 1 0.00 2.39
Comm | 1 0.00 2.39
Comme | 1 0.00 2.39
Concrete Mixer | 67 0.00 2.39
Convertible | 428 0.03 2.42
DELIV | 14 0.00 2.43
DELV | 2 0.00 2.43
DEMA- | 1 0.00 2.43
DIRT | 1 0.00 2.43
DOT R | 1 0.00 2.43
DP | 367 0.03 2.45
DS | 1,006 0.07 2.53
DUMP | 17 0.00 2.53
DUMPT | 1 0.00 2.53
DUNBA | 1 0.00 2.53
Deliv | 1 0.00 2.53
Dump | 414 0.03 2.56
Dumps | 1 0.00 2.56
E BIK | 1 0.00 2.56
E ONE | 1 0.00 2.56
E SCO | 1 0.00 2.56
E-BIK | 4 0.00 2.56
E-Bik | 2 0.00 2.56
E-MOT | 1 0.00 2.56
E/BIK | 1 0.00 2.56
EBIKE | 2 0.00 2.56
ELECT | 15 0.00 2.56
EMRGN | 1 0.00 2.56
EMS A | 1 0.00 2.56
EMS H | 1 0.00 2.56
EN | 2 0.00 2.56
EPO | 1 0.00 2.56
Elect | 3 0.00 2.56
Enclosed Body - Removable Enclosure | 2 0.00 2.56
FB | 238 0.02 2.58
FD NY | 1 0.00 2.58
FD tr | 1 0.00 2.58
FDNY | 34 0.00 2.58
FED E | 2 0.00 2.58
FIRE | 25 0.00 2.58
FIRE TRUCK | 850 0.06 2.64
FIRET | 10 0.00 2.64
FLAT | 9 0.00 2.65
FLATB | 1 0.00 2.65
FORD | 4 0.00 2.65
FORK | 2 0.00 2.65
FORK- | 1 0.00 2.65
FORKL | 11 0.00 2.65
FR | 34 0.00 2.65
FREE | 1 0.00 2.65
FREIG | 6 0.00 2.65
FRONT | 1 0.00 2.65
Fd fi | 1 0.00 2.65
Fdny | 2 0.00 2.65
Fire | 15 0.00 2.65
Firet | 1 0.00 2.65
Flat Bed | 226 0.02 2.67
Flat Rack | 51 0.00 2.67
Ford | 1 0.00 2.67
Forkl | 2 0.00 2.67
Freig | 1 0.00 2.67
GARBA | 4 0.00 2.67
GE/SC | 1 0.00 2.67
GG | 205 0.01 2.69
GOLF | 1 0.00 2.69
GOVER | 1 0.00 2.69
GR | 2 0.00 2.69
GRAY | 1 0.00 2.69
Garba | 2 0.00 2.69
Garbage or Refuse | 223 0.02 2.70
Glass Rack | 1 0.00 2.70
Golf | 2 0.00 2.70
H/WH | 1 0.00 2.70
HEAVY | 1 0.00 2.70
HELP | 1 0.00 2.70
HO | 1 0.00 2.70
HORSE | 2 0.00 2.70
HWY C | 1 0.00 2.70
Hand | 1 0.00 2.70
Hopper | 3 0.00 2.70
Humme | 1 0.00 2.70
IP | 1 0.00 2.70
Ice C | 1 0.00 2.71
Inter | 1 0.00 2.71
John | 1 0.00 2.71
LARGE COM VEH(6 OR MORE TIRES) | 14,527 1.06 3.76
LF | 16 0.00 3.76
LIBER | 1 0.00 3.76
LIMO | 3 0.00 3.76
LIVER | 2 0.00 3.76
LIVERY VEHICLE | 10,481 0.76 4.53
LL | 169 0.01 4.54
LOADE | 1 0.00 4.54
LP | 1 0.00 4.54
LW | 2 0.00 4.54
Lift | 1 0.00 4.54
Lift Boom | 19 0.00 4.54
Light | 1 0.00 4.54
Limou | 1 0.00 4.54
Livestock Rack | 2 0.00 4.54
Log | 1 0.00 4.54
Lunch Wagon | 4 0.00 4.54
MACK | 2 0.00 4.54
MAIL | 4 0.00 4.54
MARK | 1 0.00 4.54
MB | 20 0.00 4.54
MD | 32 0.00 4.54
MH | 3 0.00 4.54
MILLI | 1 0.00 4.54
MINI | 3 0.00 4.54
MK | 1 0.00 4.54
MOPED | 10 0.00 4.55
MOTOR | 6 0.00 4.55
MOTORCYCLE | 6,536 0.48 5.02
MS | 42 0.00 5.02
MTA B | 4 0.00 5.02
Marke | 1 0.00 5.02
Mini | 1 0.00 5.02
Minibike | 7 0.00 5.03
Minicycle | 4 0.00 5.03
Mo pa | 1 0.00 5.03
Moped | 56 0.00 5.03
Motor | 2 0.00 5.03
Motorbike | 41 0.00 5.03
Motorcycle | 877 0.06 5.10
Motorized Home | 4 0.00 5.10
Motorscooter | 62 0.00 5.10
Multi-Wheeled Vehicle | 15 0.00 5.10
N/A | 1 0.00 5.10
NEW Y | 2 0.00 5.10
NS AM | 1 0.00 5.10
NYC B | 1 0.00 5.10
NYC M | 1 0.00 5.10
NYC a | 1 0.00 5.10
NYPD | 2 0.00 5.10
OIL T | 2 0.00 5.10
OML | 1 0.00 5.10
OMNIB | 1 0.00 5.10
OMR | 5 0.00 5.10
OMT | 2 0.00 5.10
OTHER | 23,972 1.74 6.85
Open Body | 5 0.00 6.85
P/SE | 1 0.00 6.85
P/SH | 2 0.00 6.85
PAS | 2 0.00 6.85
PASSENGER VEHICLE | 715,227 52.00 58.85
PEDIC | 1 0.00 58.85
PEDICAB | 35 0.00 58.85
PICK- | 1 0.00 58.85
PICK-UP TRUCK | 23,069 1.68 60.53
PICKU | 4 0.00 60.53
PK | 405 0.03 60.56
PL | 6 0.00 60.56
PM | 7 0.00 60.56
POSTA | 4 0.00 60.56
POSTO | 1 0.00 60.56
POWER | 7 0.00 60.56
PSD | 4 0.00 60.56
PSR | 1 0.00 60.56
PUMP | 1 0.00 60.56
Pallet | 7 0.00 60.56
Pedicab | 11 0.00 60.56
Pick | 1 0.00 60.56
Pick- | 1 0.00 60.56
Pick-up Truck | 3,883 0.28 60.85
Picku | 1 0.00 60.85
Pickup with mounted Camper | 5 0.00 60.85
Porta | 1 0.00 60.85
R/V | 1 0.00 60.85
R/V C | 1 0.00 60.85
REFRI | 3 0.00 60.85
RENTA | 1 0.00 60.85
REP | 2 0.00 60.85
RESCU | 1 0.00 60.85
RF | 56 0.00 60.85
RV | 11 0.00 60.85
RYDER | 1 0.00 60.85
Refrigerated Van | 58 0.00 60.86
SANIT | 6 0.00 60.86
SCHOO | 16 0.00 60.86
SCOOT | 10 0.00 60.86
SCOOTER | 256 0.02 60.88
SE | 1 0.00 60.88
SELF | 4 0.00 60.88
SEMI | 2 0.00 60.88
SEMI- | 1 0.00 60.88
SKATE | 3 0.00 60.88
SMALL COM VEH(4 TIRES) | 14,559 1.06 61.94
SMART | 1 0.00 61.94
SNOW | 1 0.00 61.94
SP | 20 0.00 61.94
SPC | 2 0.00 61.94
SPORT UTILITY / STATION WAGON | 313,498 22.79 84.73
SPRIN | 1 0.00 84.73
ST | 15 0.00 84.74
ST150 | 1 0.00 84.74
STAK | 5 0.00 84.74
STREE | 6 0.00 84.74
SUBN/ | 1 0.00 84.74
SUBUR | 1 0.00 84.74
SUV | 3 0.00 84.74
SWEEP | 1 0.00 84.74
Sanit | 1 0.00 84.74
Schoo | 1 0.00 84.74
School Bus | 9 0.00 84.74
Scoot | 1 0.00 84.74
Sedan | 56,607 4.12 88.85
Semi | 1 0.00 88.85
Semi- | 1 0.00 88.85
Skate | 1 0.00 88.85
Smart | 1 0.00 88.85
Snow Plow | 1 0.00 88.85
Spc | 1 0.00 88.85
Sprin | 2 0.00 88.85
Stake or Rack | 11 0.00 88.85
Station Wagon/Sport Utility Vehicle | 43,706 3.18 92.03
Subn | 1 0.00 92.03
Sweep | 1 0.00 92.03
TANK | 2 0.00 92.03
TANKE | 2 0.00 92.03
TAXI | 50,669 3.68 95.72
TK | 2,485 0.18 95.90
TN | 77 0.01 95.90
TOW | 3 0.00 95.90
TOW T | 13 0.00 95.90
TOWTR | 2 0.00 95.90
TR | 160 0.01 95.92
TRAC | 3 0.00 95.92
TRACK | 1 0.00 95.92
TRACT | 21 0.00 95.92
TRAFF | 1 0.00 95.92
TRAIL | 52 0.00 95.92
TRANS | 3 0.00 95.92
TRK | 2 0.00 95.92
TRL | 3 0.00 95.92
TRLPM | 1 0.00 95.92
TRLR | 2 0.00 95.92
TRUCK | 45 0.00 95.93
TT | 130 0.01 95.94
Tanker | 76 0.01 95.94
Taxi | 5,894 0.43 96.37
Tow | 1 0.00 96.37
Tow T | 1 0.00 96.37
Tow Truck | 4 0.00 96.37
Tow Truck / Wrecker | 125 0.01 96.38
Tow t | 1 0.00 96.38
Tow-t | 1 0.00 96.38
Tract | 3 0.00 96.38
Tractor Truck Diesel | 1,175 0.09 96.46
Tractor Truck Gasoline | 159 0.01 96.48
Trail | 5 0.00 96.48
Truck | 2 0.00 96.48
U.S. | 1 0.00 96.48
UHAUL | 4 0.00 96.48
UHUAL | 1 0.00 96.48
UKN | 1 0.00 96.48
UNKNO | 1 0.00 96.48
UNKNOWN | 19,929 1.45 97.93
UPS T | 1 0.00 97.93
UPS t | 1 0.00 97.93
US PO | 1 0.00 97.93
USPS | 16 0.00 97.93
UT | 1 0.00 97.93
UTIL | 4 0.00 97.93
UTILI | 6 0.00 97.93
UTV | 1 0.00 97.93
Unkno | 1 0.00 97.93
Util | 1 0.00 97.93
Utili | 2 0.00 97.93
VAN | 26,540 1.93 99.86
VAN T | 1 0.00 99.86
VAN/T | 2 0.00 99.86
VANG | 1 0.00 99.86
VC | 2 0.00 99.86
VN | 712 0.05 99.91
VT | 2 0.00 99.91
Van | 906 0.07 99.98
Van Camper | 4 0.00 99.98
Vanette | 1 0.00 99.98
Veriz | 1 0.00 99.98
WAGON | 1 0.00 99.98
WD | 1 0.00 99.98
WHBL | 1 0.00 99.98
WHEEL | 1 0.00 99.98
WHITE | 1 0.00 99.98
WORK | 2 0.00 99.98
Wagon | 1 0.00 99.98
Well Driller | 1 0.00 99.98
Wheel | 1 0.00 99.98
Work | 1 0.00 99.98
YELLO | 2 0.00 99.98
ambu | 5 0.00 99.98
ambul | 21 0.00 99.98
armor | 1 0.00 99.98
box t | 7 0.00 99.98
bulld | 1 0.00 99.98
bus | 7 0.00 99.98
cargo | 1 0.00 99.98
cate | 1 0.00 99.98
cemen | 1 0.00 99.98
chevo | 1 0.00 99.98
com | 2 0.00 99.98
comme | 3 0.00 99.98
deliv | 3 0.00 99.98
delv | 3 0.00 99.98
dp | 1 0.00 99.98
dsny | 1 0.00 99.98
dump | 5 0.00 99.98
e BIK | 1 0.00 99.98
e amb | 1 0.00 99.98
e com | 1 0.00 99.98
e sco | 1 0.00 99.98
e-bik | 2 0.00 99.98
east | 1 0.00 99.98
elect | 3 0.00 99.98
f550 | 1 0.00 99.98
fd tr | 1 0.00 99.98
fdny | 8 0.00 99.98
fire | 13 0.00 99.99
firet | 1 0.00 99.99
flat | 1 0.00 99.99
flatb | 1 0.00 99.99
ford | 2 0.00 99.99
fork | 1 0.00 99.99
forkl | 4 0.00 99.99
garba | 3 0.00 99.99
gator | 1 0.00 99.99
ice c | 1 0.00 99.99
icecr | 1 0.00 99.99
ladde | 1 0.00 99.99
light | 1 0.00 99.99
mail | 1 0.00 99.99
mcy | 1 0.00 99.99
mopad | 1 0.00 99.99
motor | 3 0.00 99.99
mta | 1 0.00 99.99
mta b | 2 0.00 99.99
nyc a | 1 0.00 99.99
omni | 1 0.00 99.99
p/sh | 1 0.00 99.99
pas | 1 0.00 99.99
passa | 1 0.00 99.99
pick | 4 0.00 99.99
posta | 2 0.00 99.99
power | 1 0.00 99.99
rd/s | 1 0.00 99.99
refg | 1 0.00 99.99
rv | 2 0.00 99.99
sanit | 2 0.00 99.99
schoo | 7 0.00 99.99
sciss | 1 0.00 99.99
scoot | 2 0.00 99.99
seagr | 1 0.00 99.99
self | 1 0.00 99.99
semi | 2 0.00 99.99
sgws | 1 0.00 99.99
spc p | 1 0.00 99.99
spec | 1 0.00 99.99
stree | 1 0.00 99.99
subn | 3 0.00 99.99
tk | 1 0.00 99.99
tow | 2 0.00 99.99
tow t | 3 0.00 99.99
trac | 1 0.00 99.99
tract | 6 0.00 99.99
trail | 17 0.00 99.99
trk | 1 0.00 99.99
trlr | 1 0.00 99.99
truck | 9 0.00 99.99
uhaul | 1 0.00 99.99
ulili | 1 0.00 99.99
unk | 3 0.00 99.99
unkno | 4 0.00 99.99
ups t | 1 0.00 99.99
usps | 2 0.00 99.99
utili | 6 0.00 99.99
van | 75 0.01 100.00
van t | 1 0.00 100.00
vol | 1 0.00 100.00
------------------------------------+-----------------------------------
Total | 1,375,333 100.00
- Find some key words. E.g. BICYC, BICYCLE, Bike
[18]:
gen bicyclerelated = vehicletypecode1 == "BICYC" | ///
vehicletypecode1 == "BICYCLE" | ///
vehicletypecode1 == "Bike"
[19]:
tab bicyclerelated
bicyclerela |
ted | Freq. Percent Cum.
------------+-----------------------------------
0 | 1,378,523 99.46 99.46
1 | 7,480 0.54 100.00
------------+-----------------------------------
Total | 1,386,003 100.00
The above method requires that vehicletypecode1 have exactly those characters within the double quotation marks. If “BICYC” were spelt in lower capitals, the command above will not capture it. Hence, we can “improve” upon the code by using lower() or upper() functions to ensure that the strings are in the right form.
[20]:
gen bicyclerelated_2 = upper(vehicletypecode1) == "BICYC" | ///
upper(vehicletypecode1) == "BICYCLE" | ///
upper(vehicletypecode1) == "BIKE"
[21]:
tab bicyclerelated_2
bicyclerela |
ted_2 | Freq. Percent Cum.
------------+-----------------------------------
0 | 1,378,523 99.46 99.46
1 | 7,480 0.54 100.00
------------+-----------------------------------
Total | 1,386,003 100.00
The method above is tedious because it requires that you look for every single keyword that is related to a bicycle. For example, in order to identify Minibike, you need to know that it exists in vehicletypecode1. We can use Stata’s strpos function to look within a string for a keyword. If strpos finds the keyword, the function will return the value of the position in the string. But instead of storing the position, we will use a boolean logic to identify that observation and create the
dummy variable.
[22]:
gen bicyclerelated_3 = strpos(upper(vehicletypecode1),"BICYC" ) > 0 | ///
strpos(upper(vehicletypecode1),"BIKE" ) > 0
[23]:
tab bicyclerelated_3
bicyclerela |
ted_3 | Freq. Percent Cum.
------------+-----------------------------------
0 | 1,378,473 99.46 99.46
1 | 7,530 0.54 100.00
------------+-----------------------------------
Total | 1,386,003 100.00
Next, we can also use the screening command. This command is a little more complicated but it allows for spelling mistakes to be captured as well.
[24]:
screening ,sources(vehicletypecode1, upper) keys(BICYCLE BIKE) letters(5 3) explore(count)
Source | Key | Freq. | Percent
--------------------+-------+------------+---------
vehicletypecode1 | BICYC | 5568 | 73.84
| BIK | 1973 | 26.16
--------------------+-------+------------+---------
| Total | 7541 | 100.00
[25]:
screening ,sources(vehicletypecode1, upper) keys(BICYCLE BIKE) letters(5 3) newcode(bicyclerelated_4)
[26]:
tab bicyclerelated_4
bicyclerela |
ted_4 | Freq. Percent Cum.
------------+-----------------------------------
1 | 5,568 73.84 73.84
2 | 1,973 26.16 100.00
------------+-----------------------------------
Total | 7,541 100.00
Explore the coodinates¶
[27]:
codebook latitude
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
latitude LATITUDE
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
type: numeric (float)
range: [0,43.344444] units: 1.000e-06
unique values: 70,056 missing .: 236,251/1,386,003
mean: 40.7115
std. dev: .706688
percentiles: 10% 25% 50% 75% 90%
40.619 40.6687 40.7229 40.7674 40.837
[28]:
codebook longitude
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
longitude LONGITUDE
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
type: numeric (float)
range: [-201.35999,0] units: 1.000e-06
unique values: 48,842 missing .: 236,251/1,386,003
mean: -73.9102
std. dev: 1.72118
percentiles: 10% 25% 50% 75% 90%
-74.003 -73.9782 -73.9311 -73.8678 -73.8049
When we examine the codebook outputs. It is helpful to look at the unique values and the missing. From this we know that these coordinates are not all unique and that there are missing observations in the data.
We will try to fill in the missing latitude and logitude based on other characteristics about the intersection. But first we will create a dummy variable so that we can keep track of observations with missing longitude and latitudes.
[29]:
gen mi_latlong = mi(latitude) & mi(longitude)
Now we can see the number of missing over the years
[30]:
tab mi_latlong year_SIF, column
+-------------------+
| Key |
|-------------------|
| frequency |
| column percentage |
+-------------------+
| year_SIF
mi_latlong | 2012 2013 2014 2015 2016 2017 2018 | Total
-----------+-----------------------------------------------------------------------------+----------
0 | 85,451 171,915 172,726 182,957 136,345 212,986 187,372 | 1,149,752
| 84.99 84.39 83.84 84.04 59.71 92.84 93.58 | 82.95
-----------+-----------------------------------------------------------------------------+----------
1 | 15,089 31,809 33,302 34,734 92,019 16,437 12,860 | 236,250
| 15.01 15.61 16.16 15.96 40.29 7.16 6.42 | 17.05
-----------+-----------------------------------------------------------------------------+----------
Total | 100,540 203,724 206,028 217,691 228,364 229,423 200,232 | 1,386,002
| 100.00 100.00 100.00 100.00 100.00 100.00 100.00 | 100.00
The column option tells stata to provide percentages by each year.
There are several key variables that we can use to determine if the intersections are the same. - borough - zipcode - onstreetname - crossstreetname
We will use Stata’s variable[_n-1] code to compare between rows of the observations. The idea is that if the borough, zipcode, onstreetname and crossstreetname of the previous row of observation is the same as the borough of the current row of observation, then the intersection has to be the same. Therefore, if the latitude and longitude are missing, we can fill it in with the non-missing latitude and longitude from the previous row.
First, we want to sort the data in order.
[31]:
sort onstreetname crossstreetname latitude longitude
[32]:
replace latitude = latitude[_n-1] if ///
mi(latitude) & ///
(borough == borough[_n-1]) & ///
(zipcode == zipcode[_n-1]) & ///
(onstreetname == onstreetname[_n-1]) & ///
(crossstreetname == crossstreetname[_n-1])
replace longitude = longitude[_n-1] if ///
mi(longitude) & ///
(borough == borough[_n-1]) & ///
(zipcode == zipcode[_n-1]) & ///
(onstreetname == onstreetname[_n-1]) & ///
(crossstreetname == crossstreetname[_n-1])
(74,517 real changes made)
(74,517 real changes made)
We managed to “back fill” quite a large number of observations! Now, we want to identify observations that still have missing latitude and longitudes.
[33]:
gen mi_latlong2 = mi(latitude) & mi(longitude)
[34]:
tab mi_latlong2 year_SIF, column
+-------------------+
| Key |
|-------------------|
| frequency |
| column percentage |
+-------------------+
mi_latlong | year_SIF
2 | 2012 2013 2014 2015 2016 2017 2018 | Total
-----------+-----------------------------------------------------------------------------+----------
0 | 86,719 174,277 175,092 185,272 188,838 220,422 193,649 | 1,224,269
| 86.25 85.55 84.98 85.11 82.69 96.08 96.71 | 88.33
-----------+-----------------------------------------------------------------------------+----------
1 | 13,821 29,447 30,936 32,419 39,526 9,001 6,583 | 161,733
| 13.75 14.45 15.02 14.89 17.31 3.92 3.29 | 11.67
-----------+-----------------------------------------------------------------------------+----------
Total | 100,540 203,724 206,028 217,691 228,364 229,423 200,232 | 1,386,002
| 100.00 100.00 100.00 100.00 100.00 100.00 100.00 | 100.00
[35]:
tab mi_latlong year_SIF, column
+-------------------+
| Key |
|-------------------|
| frequency |
| column percentage |
+-------------------+
| year_SIF
mi_latlong | 2012 2013 2014 2015 2016 2017 2018 | Total
-----------+-----------------------------------------------------------------------------+----------
0 | 85,451 171,915 172,726 182,957 136,345 212,986 187,372 | 1,149,752
| 84.99 84.39 83.84 84.04 59.71 92.84 93.58 | 82.95
-----------+-----------------------------------------------------------------------------+----------
1 | 15,089 31,809 33,302 34,734 92,019 16,437 12,860 | 236,250
| 15.01 15.61 16.16 15.96 40.29 7.16 6.42 | 17.05
-----------+-----------------------------------------------------------------------------+----------
Total | 100,540 203,724 206,028 217,691 228,364 229,423 200,232 | 1,386,002
| 100.00 100.00 100.00 100.00 100.00 100.00 100.00 | 100.00
There are other ways to identify the latitude and longitude of these missing variables through a process called Geocoding. But we will explore that another time.
For now, we will drop observations that are still missing.
[36]:
drop if mi_latlong2 == 1
(161,734 observations deleted)
There are also some observations that have coordinates that are outside the boundaries of New York City. We will drop them as well.
[37]:
drop if longitude < -74.5 | longitude > -73
drop if latitude < 40 | latitude > 41
(607 observations deleted)
(8 observations deleted)
Since, there are many observations, I will pull out a random sample of 2% of the data to quickly visualize. In Stata, we use preserve to keep the data at that point in memory and manipulate the data to do calculations/obtain weights/collapse/etc. Then, after we obtain our numbers/graphs/etc, we will use store to “get back” to the dataset that was at the point of preserve. We typically use preserve and restore commands when collapsing data.
[40]:
*preserve
sample 2
twoway scatter latitude longitude, msymbol(o) msize(tiny)
*restore
(1,199,181 observations deleted)
[38]:
display "$S_DATE $S_TIME"
notes: "Cleaned $S_DATE $S_TIME"
notes list
7 May 2019 14:55:13
_dta:
1. "Downloaded 9 Dec 2018 11:30:26"
2. "Cleaned 7 May 2019 14:55:13"
[39]:
*save "..\working_data\NYPD_Motor_Vehicle_Collisions_clean.dta"